- The Case
- Learning Goals
- The Data Science Process
- The Tidyverse
8/15/2017
RR without any packages offers ways to do most of the things we will see today.R is not a very good programming language.The Tidyverse is a set of packages that promote code which is:
When you master the Tidyverse, you spend more time thinking about your problem and less time thinking about your code.
AirBnB listings, schedule, and review text for the Boston area, for a time period…that you will check.
library(tidyverse)
listings <- read_csv('../data/listings.csv')
calendar <- read_csv('../data/calendar.csv')
Conceptually, we need to: filter to only JP listings, arrange the listings by rating, and select only the columns we want to see. Let's write some code!
Working with your partner, please rewrite the JP code using the pipe operator. Here's the first line to get you started:
listings %>%
filter(neighbourhood == 'Jamaica Plain')
# ...
listings %>%
filter(neighbourhood == 'Jamaica Plain') %>% # filter needs a logical test
arrange(desc(review_scores_rating)) %>% # desc() makes descending order
select(neighbourhood, name, review_scores_rating)
You are going to spend a long weekend in Back Bay with 50 of your closest friends.
Working with your partner, modify your code slightly to construct a table of the listings in Back Bay, sorted by the number of people who can stay there. You may need to use glimpse to see which columns you'll want to use.
listings %>%
filter(neighbourhood == 'Back Bay') %>%
arrange(accommodates) %>%
select(neighbourhood, name, accommodates, price)
Modify the summary table as follows:
summary_table <- listings %>%
mutate(price = price %>% gsub('\\$|', '',.) %>% as.numeric(),
price_per = price / accommodates,
weekly_price = weekly_price %>% gsub('\\$|', '',.) %>% as.numeric(),
weekly_price_per = weekly_price / accommodates) %>%
group_by(neighbourhood, property_type) %>%
summarize(n = n(),
mean_rating = mean(review_scores_rating, na.rm = TRUE),
price_per = mean(price_per, na.rm = TRUE),
weekly_price_per = mean(weekly_price_per, na.rm = T),
capacity = sum(beds))
If you check summary_table, you'll notice that it has exactly one grouping column: neighbourhood. Working with your partner,
mutate to construct a rank column where the top-ranked row has the highest value of n (i.e. most popular type). You might want the min_rank function and the desc function we used when sorting. What behavior do you observe?n using arrange. What behavior do you observe now? See if you can get the table sorted in descending order by rank, while keeping neighbourhoods grouped together.ranked_summary_table <- summary_table %>%
mutate(rank = min_rank(desc(n))) %>%
filter(rank <= 3) %>%
arrange(neighbourhood, rank)
filter and summarisejoining datacalendar %>%
summarise(earliest = min(date),
latest = max(date))
## # A tibble: 1 x 2 ## earliest latest ## <date> <date> ## 1 2016-09-06 2017-09-05
Pretty current. But what if we want to focus on listings that have been active in the last three months?
Construct a table from the calendar data giving the listings that had a valid listed date between June 1st, 2016 and today. You determine what "valid" means in this context.
Hint: You can represent a date using the function
lubridate::mdy('6/1/2016')
## [1] "2016-06-01"
You can also use lubridate::today(). You can use max() to get the most recent date.
current_table <- calendar %>%
filter(!is.na(price),
date < lubridate::today(),
date > lubridate::mdy('6/1/2016')) %>%
group_by(listing_id) %>%
summarise(last_active = max(date))
The information we need is distributed between two tables – how can we get there?
We need a key that tells us which calendar rows correspond to which listings.
listings$idcorresponds tocalendar_listing$id
joinThe join family of functions lets us add columns from one table to another using a key.
x %>% left_join(y) : most common, keeps all rows of x but not necessarily y.x %>% right_join(y) : keeps all rows of y but not necessarily x.x %>% outer_join(y) : keeps all rows of both x and yx %>% full_join(y) : keeps only rows of x that match in y and vice versa.We'll use left_join for this case.
Graphical excellence is the well-designed presentation of interesting data – a matter of substance, of statistics, and of design. Graphical excellence consists of complex ideas communicated with clarity, precision, and efficiency. – Edward Tufte
A grammar is a set of components (ingredients) that you can combine to create new things. Many grammars have required components: if you're missing one, you're doing it wrong. In baking….

gg in ggplot2.tidyverse

Data: almost always a data_frameAesthetic mapping: relation of data to chart components.Geometry: specific visualization type? E.g. line, bar, heatmap?Statistical transformation: how should the data be transformed or aggregated before visualizing?Theme: how should the non-data parts of the plot look?Data, aesthetics, and geometries are the required grammatical components that you always need to specify.
Does getting lots of reviews usually mean you get good reviews?
listings %>%
ggplot()
listings %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating)
listings %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating) +
geom_point()
listings %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating) +
geom_point(alpha = .2)
listings %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating) +
geom_point(alpha = .2) +
theme_bw()
listings %>%
filter(number_of_reviews < 100) %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating) +
geom_point(alpha = .2) +
theme_bw()
listings %>%
filter(number_of_reviews < 100) %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating) +
geom_point(alpha = .2) +
theme_bw() +
labs(x='Number of Reviews', y='Review Score', title='Review Volume and Review Quality')
listings %>%
filter(number_of_reviews < 100) %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating) +
geom_point(alpha = .2, color = 'firebrick') +
theme_bw() +
labs(x='Number of Reviews', y='Review Score',title='Review Volume and Review Quality')
listings %>%
filter(number_of_reviews < 100) %>%
ggplot() +
aes(x = review_scores_value,
y = review_scores_location,
size = number_of_reviews) +
geom_point(alpha = .2, color = 'firebrick') +
theme_bw()
listings %>%
filter(number_of_reviews < 100) %>%
ggplot() +
aes(x = review_scores_value,
y = review_scores_location,
fill = number_of_reviews) +
geom_tile() +
theme_bw()
The following code computes the average price of all listings on each day in the data set:
average_price_table <- calendar %>%
mutate(price = price %>% gsub('\\$|', '',.) %>% as.numeric()) %>%
group_by(date) %>%
summarise(mean_price = mean(price, na.rm = TRUE))
Use geom_line() to visualize these prices with time on the x-axis and price on the y-axis.
average_price_table %>%
ggplot() +
aes(x = date, y = mean_price) +
geom_line()
Using the summary_table object you created earlier, make a bar chart showing the number of apartments by neighbourhood. In this case, the correct geom to use is geom_bar(stat = 'identity').
summary_table %>%
filter(property_type == 'Apartment') %>%
ggplot() +
aes(x = neighbourhood, y= n) +
geom_bar(stat = 'identity')
summary_table %>%
filter(property_type == 'Apartment') %>%
ggplot() +
aes(x = reorder(neighbourhood, n), y=n) +
coord_flip() +
geom_bar(stat = 'identity')
summary_table %>%
ggplot() +
aes(x = reorder(neighbourhood, n), y=n, fill = property_type) +
coord_flip() +
geom_bar(stat = 'identity')
listings %>%
filter(number_of_reviews < 100) %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating, color = property_type) +
geom_point(alpha = .5) +
theme_bw() +
labs(x='Number of Reviews', y='Review Score', title='Review Volume and Review Quality')
listings %>%
filter(number_of_reviews < 100) %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating, color = property_type) +
geom_point(alpha = .5) +
theme_bw() +
facet_wrap(~property_type) +
labs(x='Number of Reviews', y='Review Score', title='Review Volume and Review Quality')
listings %>%
select(number_of_reviews, contains("review_scores"), - review_scores_rating) %>%
gather(key = type, value = score, -number_of_reviews) %>%
ggplot() +
aes(x = factor(score), y = number_of_reviews) +
geom_boxplot() +
facet_wrap(~type)
We are going to make a simple business intelligence (BI) dashboard for AirBnB, using the wrangling and visualization skills we have developed in this session. You will use this dashboard to lead a meeting with decision-makers on where to prioritize host recruitment efforts.
wrangle_viz/dashboard.Rmdknit button at the top of RStudio and observe the result. If you see a dashboard, then are good to go.author metadata up topR "code chunks" that begin with ```{r}knit the dashboard one last time and place it in the shared Dropbox folder."In my experience, the vast majority of graphing agony is due to insufficient data wrangling." - Jenny Bryan
